Purpose of workshop

Exploring the normality of numerical columns in a novel data set and producing publication quality tables and reports


Objectives

  1. Using summary statistics to better understand individual columns in a data set.
  2. Assessing data normality in numerical columns.
  3. Producing a publishable HTML with summary statistics and normality tests for columns within a data set.

Required setup

We first need to prepare our environment with the necessary packages

options(repos = list(CRAN = "http://cran.rstudio.com/"))
options(digits = 2)

install.packages("pacman")

pacman::p_load(dlookr, # Exploratory data analysis
       formattable, # HTML tables from R outputs
       kableExtra, # Alternative to formattable
       knitr, # Needed to write HTML reports
       nycflights13, # Holds the flights data set
       tidyverse) # Powerful data wrangling package suite

1.0 Load and examine a data set

# Let's load a data set from the flights data set
data("flights")

# What does the data look like?
formattable(head(flights))
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour minute time_hour
2013 1 1 517 515 2 830 819 11 UA 1545 N14228 EWR IAH 227 1400 5 15 2013-01-01 05:00:00
2013 1 1 533 529 4 850 830 20 UA 1714 N24211 LGA IAH 227 1416 5 29 2013-01-01 05:00:00
2013 1 1 542 540 2 923 850 33 AA 1141 N619AA JFK MIA 160 1089 5 40 2013-01-01 05:00:00
2013 1 1 544 545 -1 1004 1022 -18 B6 725 N804JB JFK BQN 183 1576 5 45 2013-01-01 05:00:00
2013 1 1 554 600 -6 812 837 -25 DL 461 N668DN LGA ATL 116 762 6 0 2013-01-01 06:00:00
2013 1 1 554 558 -4 740 728 12 UA 1696 N39463 EWR ORD 150 719 5 58 2013-01-01 05:00:00

1.1 Diagnose your data

# What are the properties of the data
formattable(diagnose(flights))
variables types missing_count missing_percent unique_count unique_rate
year integer 0 0.00 1 3.0e-06
month integer 0 0.00 12 3.6e-05
day integer 0 0.00 31 9.2e-05
dep_time integer 8255 2.45 1319 3.9e-03
sched_dep_time integer 0 0.00 1021 3.0e-03
dep_delay numeric 8255 2.45 528 1.6e-03
arr_time integer 8713 2.59 1412 4.2e-03
sched_arr_time integer 0 0.00 1163 3.5e-03
arr_delay numeric 9430 2.80 578 1.7e-03
carrier character 0 0.00 16 4.8e-05
flight integer 0 0.00 3844 1.1e-02
tailnum character 2512 0.75 4044 1.2e-02
origin character 0 0.00 3 8.9e-06
dest character 0 0.00 105 3.1e-04
air_time numeric 9430 2.80 510 1.5e-03
distance numeric 0 0.00 214 6.4e-04
hour numeric 0 0.00 20 5.9e-05
minute numeric 0 0.00 60 1.8e-04
time_hour POSIXct 0 0.00 6936 2.1e-02
  • variables: name of each variable
  • types: data type of each variable
  • missing_count: number of missing values
  • missing_percent: percentage of missing values
  • unique_count: number of unique values
  • unique_rate: rate of unique value - unique_count / number of observations

Box plot

(c) Cédric Scherer


Skewness

(c) Andrey Akinshin


NOTE

  • “Skewness” has multiple definitions. Several underlying equations mey be at play
  • Skewness is “designed” for distributions with one peak (unimodal); it’s meaningless for distributions with multiple peaks (multimodal).
  • Most default skewness definitions are not robust: a single outlier could completely distort the skewness value.
  • We can’t make conclusions about the locations of the mean and the median based on the skewness sign.

Kurtosis

(c) Andrey Akinshin


NOTE

  • There are multiple definitions of kurtosis - i.e., “kurtosis” and “excess kurtosis,” but there are other definitions of this measure.
  • Kurtosis may work fine for distributions with one peak (unimodal); it’s meaningless for distributions with multiple peaks (multimodal).
  • The classic definition of kurtosis is not robust: it could be easily spoiled by extreme outliers.

1.2 Describe your continuous data

# Summary statistics 
formattable(describe(flights))
variable n na mean sd se_mean IQR skewness kurtosis p00 p01 p05 p10 p20 p25 p30 p40 p50 p60 p70 p75 p80 p90 p95 p99 p100
year 336776 0 2013.0 0.0 0.0000 0 NaN NaN 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013
month 336776 0 6.5 3.4 0.0059 6 -0.01340 -1.19 1 1 1 2 3 4 4 5 7 8 9 10 10 11 12 12 12
day 336776 0 15.7 8.8 0.0151 15 0.00774 -1.19 1 1 2 4 7 8 10 13 16 19 22 23 25 28 29 31 31
dep_time 328521 8255 1349.1 488.3 0.8519 837 -0.02474 -1.09 1 551 624 703 827 907 1001 1200 1401 1536 1700 1744 1830 2008 2112 2251 2400
sched_dep_time 336776 0 1344.3 467.3 0.8053 823 -0.00586 -1.20 106 600 630 705 830 906 1000 1200 1359 1529 1655 1729 1815 1945 2050 2225 2359
dep_delay 328521 8255 12.6 40.2 0.0702 16 4.80254 43.95 -43 -12 -9 -7 -6 -5 -4 -3 -2 0 6 11 18 49 88 191 1301
arr_time 328063 8713 1502.1 533.3 0.9310 836 -0.46782 -0.19 1 22 736 853 1023 1104 1154 1338 1535 1725 1854 1940 2026 2159 2248 2345 2400
sched_arr_time 336776 0 1536.4 497.5 0.8572 821 -0.35314 -0.38 1 40 815 917 1040 1124 1214 1400 1556 1738 1900 1945 2030 2200 2246 2353 2359
arr_delay 327346 9430 6.9 44.6 0.0780 31 3.71682 29.23 -86 -44 -32 -26 -19 -17 -14 -10 -5 1 9 14 21 52 91 190 1272
flight 336776 0 1971.9 1632.5 2.8130 2912 0.66160 -0.85 1 11 91 209 421 553 706 1115 1496 1903 2889 3465 3830 4471 4695 5736 8500
air_time 327346 9430 150.7 93.7 0.1638 110 1.07071 0.86 20 33 40 47 71 82 93 112 129 146 167 192 214 319 339 364 695
distance 336776 0 1039.9 733.2 1.2635 887 1.12869 1.19 17 169 199 214 427 502 544 733 872 1023 1096 1389 1598 2446 2475 2586 4983
hour 336776 0 13.2 4.7 0.0080 8 -0.00054 -1.21 1 6 6 7 8 9 10 12 13 15 16 17 18 19 20 22 23
minute 336776 0 26.2 19.3 0.0333 36 0.09293 -1.24 0 0 0 0 5 8 11 20 29 30 40 44 45 55 58 59 59
  • n : number of observations excluding missing values
  • na : number of missing values
  • mean : arithmetic average
  • sd : standard deviation
  • se_mean : standard error mean. sd/sqrt(n)
  • IQR : interquartile range (Q3-Q1)
  • skewness : skewness
  • kurtosis : kurtosis
  • p25 : Q1. 25% percentile
  • p50 : median. 50% percentile
  • p75 : Q3. 75% percentile
  • p01, p05, p10, p20, p30 : 1%, 5%, 20%, 30% percentiles
  • p40, p60, p70, p80 : 40%, 60%, 70%, 80% percentiles
  • p90, p95, p99, p100 : 90%, 95%, 99%, 100% percentiles

1.3 Describe your continuous data: refined

The above is pretty overwhelming, and most people don’t care about percentiles outside of Q1, Q3, and the median (Q2).

# Summary statistics, selecting the desired ones
RefinedTable <- flights %>%
  describe() %>%
  select(variable, n, na, mean, sd, se_mean, IQR, skewness, kurtosis, p25, p50, p75)

formattable(RefinedTable)
variable n na mean sd se_mean IQR skewness kurtosis p25 p50 p75
year 336776 0 2013.0 0.0 0.0000 0 NaN NaN 2013 2013 2013
month 336776 0 6.5 3.4 0.0059 6 -0.01340 -1.19 4 7 10
day 336776 0 15.7 8.8 0.0151 15 0.00774 -1.19 8 16 23
dep_time 328521 8255 1349.1 488.3 0.8519 837 -0.02474 -1.09 907 1401 1744
sched_dep_time 336776 0 1344.3 467.3 0.8053 823 -0.00586 -1.20 906 1359 1729
dep_delay 328521 8255 12.6 40.2 0.0702 16 4.80254 43.95 -5 -2 11
arr_time 328063 8713 1502.1 533.3 0.9310 836 -0.46782 -0.19 1104 1535 1940
sched_arr_time 336776 0 1536.4 497.5 0.8572 821 -0.35314 -0.38 1124 1556 1945
arr_delay 327346 9430 6.9 44.6 0.0780 31 3.71682 29.23 -17 -5 14
flight 336776 0 1971.9 1632.5 2.8130 2912 0.66160 -0.85 553 1496 3465
air_time 327346 9430 150.7 93.7 0.1638 110 1.07071 0.86 82 129 192
distance 336776 0 1039.9 733.2 1.2635 887 1.12869 1.19 502 872 1389
hour 336776 0 13.2 4.7 0.0080 8 -0.00054 -1.21 9 13 17
minute 336776 0 26.2 19.3 0.0333 36 0.09293 -1.24 8 29 44

1.4 Describe categorical variables

formattable(diagnose_category(flights))
variables levels N freq ratio rank
carrier UA 336776 58665 17.420 1
carrier B6 336776 54635 16.223 2
carrier EV 336776 54173 16.086 3
carrier DL 336776 48110 14.285 4
carrier AA 336776 32729 9.718 5
carrier MQ 336776 26397 7.838 6
carrier US 336776 20536 6.098 7
carrier 9E 336776 18460 5.481 8
carrier WN 336776 12275 3.645 9
carrier VX 336776 5162 1.533 10
tailnum NA 336776 2512 0.746 1
tailnum N725MQ 336776 575 0.171 2
tailnum N722MQ 336776 513 0.152 3
tailnum N723MQ 336776 507 0.151 4
tailnum N711MQ 336776 486 0.144 5
tailnum N713MQ 336776 483 0.143 6
tailnum N258JB 336776 427 0.127 7
tailnum N298JB 336776 407 0.121 8
tailnum N353JB 336776 404 0.120 9
tailnum N351JB 336776 402 0.119 10
origin EWR 336776 120835 35.880 1
origin JFK 336776 111279 33.042 2
origin LGA 336776 104662 31.078 3
dest ORD 336776 17283 5.132 1
dest ATL 336776 17215 5.112 2
dest LAX 336776 16174 4.803 3
dest BOS 336776 15508 4.605 4
dest MCO 336776 14082 4.181 5
dest CLT 336776 14064 4.176 6
dest SFO 336776 13331 3.958 7
dest FLL 336776 12055 3.580 8
dest MIA 336776 11728 3.482 9
dest DCA 336776 9705 2.882 10
time_hour 2013-09-13 08:00:00 336776 94 0.028 1
time_hour 2013-09-20 08:00:00 336776 94 0.028 1
time_hour 2013-09-09 08:00:00 336776 93 0.028 3
time_hour 2013-09-16 08:00:00 336776 93 0.028 3
time_hour 2013-09-23 08:00:00 336776 93 0.028 3
time_hour 2013-09-19 08:00:00 336776 92 0.027 6
time_hour 2013-10-11 08:00:00 336776 92 0.027 6
time_hour 2013-09-10 08:00:00 336776 91 0.027 8
time_hour 2013-09-12 08:00:00 336776 91 0.027 8
time_hour 2013-09-17 08:00:00 336776 91 0.027 8
  • variables: category names
  • levels: group names within categories
  • N: number of observation
  • freq: number of observation at group level / number of observation at category level
  • ratio: percentage of observation at group level / number of observation at category level
  • rank: rank of the occupancy ratio of levels (order in which the groups are in the category)

1.5 Group descriptive statistics

GroupTable <- flights %>%
  group_by(carrier) %>%
  select(carrier, dep_delay, arr_delay, air_time) %>%
  describe() %>%
  select(variable, carrier, n, na, mean, sd, se_mean, IQR, skewness, kurtosis, p25, p50, p75)

formattable(GroupTable)
variable carrier n na mean sd se_mean IQR skewness kurtosis p25 p50 p75
air_time 9E 17294 1166 86.78 43 0.33 61 0.763 0.083 51 83.0 112.0
air_time AA 31947 782 188.82 82 0.46 81 0.657 -0.285 134 169.0 215.0
air_time AS 709 5 325.62 16 0.61 22 0.448 0.429 314 324.0 336.0
air_time B6 54049 586 151.18 90 0.39 120 0.776 -0.091 67 142.0 187.0
air_time DL 47658 452 173.69 85 0.39 101 0.882 -0.407 115 145.0 216.0
air_time EV 51108 3065 90.08 40 0.18 63 0.533 -0.410 53 87.0 116.0
air_time F9 681 4 229.60 15 0.58 21 0.323 -0.244 218 229.0 239.0
air_time FL 3175 85 101.14 24 0.42 46 -0.674 -0.892 72 109.0 118.0
air_time HA 342 0 623.09 21 1.12 27 0.441 0.258 608 621.5 635.0
air_time MQ 25037 1360 91.18 31 0.19 40 0.811 0.562 70 83.0 110.0
air_time OO 29 3 83.48 35 6.53 5 1.834 1.922 67 68.0 72.0
air_time UA 57782 883 211.79 101 0.42 178 0.402 0.053 135 197.0 313.0
air_time US 19831 705 88.57 75 0.53 47 2.080 3.093 42 76.0 89.0
air_time VX 5116 46 337.00 21 0.29 29 -0.071 -0.126 323 337.0 352.0
air_time WN 12044 231 147.82 56 0.51 83 1.033 0.374 110 122.0 193.0
air_time YV 544 57 65.74 20 0.85 36 0.268 -1.374 47 56.5 83.0
arr_delay 9E 17294 1166 7.38 50 0.38 36 2.859 12.432 -21 -7.0 15.0
arr_delay AA 31947 782 0.36 43 0.24 29 4.829 54.697 -21 -9.0 8.0
arr_delay AS 709 5 -9.93 36 1.37 34 2.266 7.525 -32 -17.0 2.0
arr_delay B6 54049 586 9.46 43 0.18 31 2.884 12.279 -14 -3.0 17.0
arr_delay DL 47658 452 1.64 44 0.20 28 5.442 58.944 -20 -8.0 8.0
arr_delay EV 51108 3065 15.80 50 0.22 40 2.599 9.721 -14 -1.0 26.0
arr_delay F9 681 4 21.92 62 2.36 40 5.157 49.213 -9 6.0 31.0
arr_delay FL 3175 85 20.12 54 0.96 31 3.846 21.312 -7 5.0 24.0
arr_delay HA 342 0 -6.92 75 4.06 30 14.624 247.792 -28 -13.0 2.8
arr_delay MQ 25037 1360 10.77 43 0.27 31 4.932 67.610 -13 -1.0 18.0
arr_delay OO 29 3 11.93 49 9.02 22 1.970 3.161 -16 -7.0 6.0
arr_delay UA 57782 883 3.56 41 0.17 30 3.274 16.954 -18 -6.0 12.0
arr_delay US 19831 705 2.13 33 0.23 23 3.861 25.304 -15 -6.0 8.0
arr_delay VX 5116 46 1.76 50 0.70 31 3.927 25.069 -23 -9.0 8.0
arr_delay WN 12044 231 9.65 47 0.43 30 3.235 14.342 -15 -3.0 15.0
arr_delay YV 544 57 15.56 53 2.27 40 2.510 8.701 -16 -2.0 24.2
dep_delay 9E 17416 1044 16.73 46 0.35 23 3.345 15.705 -6 -2.0 17.0
dep_delay AA 32093 636 8.59 37 0.21 10 6.729 89.344 -6 -3.0 4.0
dep_delay AS 712 2 5.80 31 1.18 10 3.955 17.806 -7 -3.0 3.0
dep_delay B6 54169 466 13.02 39 0.17 17 3.695 18.276 -5 -1.0 12.0
dep_delay DL 47761 349 9.26 40 0.18 10 7.340 92.122 -5 -2.0 5.0
dep_delay EV 51356 2817 19.96 47 0.21 30 2.943 11.680 -5 -1.0 25.0
dep_delay F9 682 3 20.22 58 2.23 22 6.371 67.873 -4 0.5 18.0
dep_delay FL 3187 73 18.73 53 0.93 21 4.459 27.084 -4 1.0 17.0
dep_delay HA 342 0 4.90 74 4.01 6 15.940 276.543 -7 -4.0 -1.0
dep_delay MQ 25163 1234 10.55 39 0.25 16 6.374 101.896 -7 -3.0 9.0
dep_delay OO 29 3 12.59 43 8.00 13 2.338 4.851 -9 -6.0 4.0
dep_delay UA 57979 686 12.11 36 0.15 15 4.415 26.382 -4 0.0 11.0
dep_delay US 19873 663 3.78 28 0.20 7 5.658 46.450 -7 -4.0 0.0
dep_delay VX 5131 31 12.87 45 0.63 12 5.332 41.026 -4 0.0 8.0
dep_delay WN 12083 192 17.71 43 0.39 19 3.965 19.869 -2 1.0 17.0
dep_delay YV 545 56 19.00 49 2.11 30 3.105 12.833 -7 -2.0 23.0

2.0 Testing normality

  • Shapiro-Wilk test & Q-Q plots
  • Testing overall normality of two columns
  • Testing normality of groups

2.1 Normality of columns


Shapiro-Wilk test

Shapiro-Wilk test looks at whether a target distribution is sample form a normal distribution

FlightsNorm <- flights %>%
  select(dep_delay, arr_delay, air_time) 

formattable(normality(FlightsNorm))
vars statistic p_value sample
dep_delay 0.52 1.4e-78 5000
arr_delay 0.70 4.8e-69 5000
air_time 0.88 1.7e-51 5000

Q-Q plots

Q-Q plots look at the quartiles of a target data set and plot it against predicted quartiles from a normal distribution

FlightsNorm %>%
plot_normality(dep_delay, arr_delay, air_time)


2.2 Normality within groups

Looking within carrier at the subgroup normality

# Shapiro-Wilk test
NormCarrierTable <- flights %>%
  group_by(carrier) %>%
  select(carrier, dep_delay) %>%
  normality()

formattable(NormCarrierTable)
variable carrier statistic p_value sample
dep_delay 9E 0.59 1.2e-74 5000
dep_delay AA 0.48 8.6e-81 5000
dep_delay AS 0.51 2.3e-40 714
dep_delay B6 0.56 4.9e-77 5000
dep_delay DL 0.37 2.8e-85 5000
dep_delay EV 0.65 1.7e-71 5000
dep_delay F9 0.47 1.1e-40 685
dep_delay FL 0.52 5.8e-69 3260
dep_delay HA 0.12 7.1e-37 342
dep_delay MQ 0.51 9.5e-79 5000
dep_delay OO 0.60 1.2e-07 32
dep_delay UA 0.52 5.8e-79 5000
dep_delay US 0.46 3.8e-81 5000
dep_delay VX 0.43 8.0e-83 5000
dep_delay WN 0.53 2.8e-78 5000
dep_delay YV 0.62 7.0e-33 601
# Q-Q plot
flights %>%
  group_by(carrier) %>%
  select(carrier, dep_delay) %>%
  plot_normality()


3.0 Produce an HTML normality summary of a data set

# Remove the '#' below to reproduce an HTML from an R script. 
#eda_web_report(flights)